File creating method for searching of data, searching method of data file and managing system for searching of data file

ABSTRACT

A method for creating/storing a file that facilitates search of data stored in a storage medium, and a data search method using the same are disclosed. The file creating method creates a rack of virtual RAM (RAM) file that is divided into several units according to divisional units for individual divisional units, and a record allocation table (RAT) file that stores a record position of each divisional unit of the RVR file. As a result, a database (DB) of large-volume irregular data can be easily created, and data analysis can be quickly achieved.

CROSS REFERENCE TO PRIOR APPLICATIONS

The present application is a Divisional Application of co-pending U.S.patent application Ser. No. 13/003,649 (filed on Mar. 21, 2011) under 35U.S.C. §120, which is a National Stage Patent Application ofInternational Patent Application No. PCT/KR2009/003790 (filed on Jul.10, 2009) under 35 U.S.C. §371, which claims priority to Korean PatentApplication No. 10-2008-0067778 (filed on Jul. 11, 2008), which are allhereby incorporated by reference in their entirety.

BACKGROUND

The present invention relates to a method for creating and storing afile that enables easier searching and a method for searching for datausing the same.

FIG. 1 is a conceptual diagram illustrating data stored in a generalhard disk. The hard disk constructs a cylinder composed of a pluralityof tracts constructing an original plate, and performs input/output(I/O) operations through a Read/Write header connected to a boom of eachtract. In FIG. 1, it is assumed that the smallest data unit (i.e.,record) is stored in each of the 1^(st), 2^(nd), 3^(rd), 4^(th), . . .i−1^(th), i^(th), and N^(th) sectors. The term ‘cluster’ means a set ofneighboring sectors. A file manager may arrange a cluster and a physicalposition using a File Allocation Table (FAT).

In the FAT system, records are sequentially arranged in a plurality ofclusters. In order to search for record information of an i-th sectorlocated in an intermediate stage, the FAT system sequentially processestracks from a first sector to the i-th sector, and finally arrives atthe i-th sector, such that it can search for records contained in thefirst to i-th sectors.

On the other hand, when using a Random Access Memory (RAM), in order toquickly extract necessary information from files including eithervariables or variable names, it is necessary for all variables to beprocessed by a Dynamic Random Access Memory (DRAM) in a programmingprocess, such that the RAM can immediately search for a position inwhich the corresponding variable name is stored. As a result, necessaryinformation can be quickly found in RAM.

However, as DRAM capacity increases, the price of a DRAM serving as asemiconductor material rapidly increases as compared to a hard disk,resulting in a reduction of the cost efficiency of large amount of datathat requires more than 128 Gigabytes. Therefore, in order to storelarge amounts of data, hard disks have been more widely used than DRAMsthroughout the world.

Therefore, disc formats of the conventional art have the followingdisadvantages.

In other words, when using a sequential access method in the same manneras in a disc to search through large amounts of stored data, the accessspeed geometrically varies with the size of data as compared to a randomaccess speed of a data record.

In addition, provided that the conventional art pre-calculates randomaccess addresses (highly integrated indexes) of all data records anddoes not store the calculated addresses in external storage, the accessspeed geometrically changes with the data size.

Specifically, in recent times, with the increasing development ofbiotechnology, large amounts of dielectric clinical geneticfunction—related data such as genomics or omics data (large capacitybiological information) has been accumulated, and researchers canextract useful information through calculation using the resultant data.The size of each irregular data (each irregular data) is about severalto tens of terabytes, and it is expected that the size of each irregulardata is about pentabytes during the execution of a greater project. Inthis case, a speed difference in data access time between the sequentialaccess method and the random access method based on the highlyintegrated index technology may be several days to several years, suchthat the conventional art will be incapable of implementing data accessor data search.

SUMMARY OF THE INVENTION

Accordingly, the present invention is directed to a method for creatinga file for data search, a method for searching for a data file, and adatabase management system for searching for the data file, thatsubstantially obviate one or more problems due to limitations anddisadvantages of the related art.

It is an object of the present invention to provide a method forconstructing a Record Allocation Table (RAT) for a variety of records ofall constituent units (i.e., page, paragraph, line, word, string,integer, and float) of a large amount of data, performing random accessof position information (i.e., address) on a hard disk, implementing adatabase management system (DBMS) for large volumes of irregular data,and allowing a hard disk to search for data as quickly as in a DRAM.

It is another object of the present invention to provide a method foranalyzing and calculating large volumes of data, allowing a huge amountof data not to be processed in a DRAM (DRAMs of more than 128 gigabytesare very expensive, resulting in a reduction in practical use), andcontrolling the huge amount of data to be processed in a hard disk at aspeed similar to a DRAM access speed.

It is yet another object of the present invention to provide a dataprocessing method for quickly and effectively searching a large file,thereby facilitating intensive research into clustering of large amountsof data.

In accordance with the present invention, the above and other objectscan be accomplished by the provision of a file creation method forsearching for a single irregular data file, the method including: (A1)receiving a divisional unit of data as an input; (A2) discriminating thesingle irregular data file using the received divisional unit, andcreating a rack of virtual RAM (RVR) file; (A3) detecting a recordposition for each divisional unit of the RVR file, and creating a recordallocation table (RAT) file; and (A4) storing the RVR file and the RATfile.

The data may be regular data, and the divisional unit may be any one of[page], [paragraph], [line] and [word].

In accordance with another aspect of the present invention, a filecreation method for searching for a single regular data file includes:(B1) discriminating between a row and a column of one regular data, andcreating a rack of virtual RAM (RVR) file; (B2) detecting a recordposition for each row or column of the RVR file, and creating a recordallocation table (RAT) file; and (B3) storing the RVR file and the RATfile.

The record position may be the size of data accumulated in the singledata extended up to a specific position where corresponding data isrecorded.

The record position may be a number of a hard disk cluster in which dataof a corresponding part is recorded.

In accordance with another aspect of the present invention, a method forsearching for a single data file includes: (C1) receiving searchinformation; (C2) detecting a record position contained in single datacorresponding to searched information from a record allocation table(RAT) file; (C3) detecting a physical storage position contained in astorage medium of data corresponding to the searched information fromthe record position; and (C4) searching for data of a physical positionof the data, and outputting the searched result.

If the single data is irregular data, the searched information may be anorder of each divisional unit.

If the single data is regular data, the searched information may be anumber of a row or column of corresponding data from among the regulardata.

The record position may be the size of data accumulated in the singledata extended up to a specific position where corresponding data isrecorded.

The detecting step (C3) of the storage position may include calculatinga cluster position from the record position using a size of data of eachdivisional unit, reading a physical storage position of the clusterposition from a file allocation table (FAT), and detecting the readphysical storage position of the cluster position.

The record position may be a number of a hard disk cluster in which dataof a corresponding part is recorded.

In accordance with another aspect of the present invention, a system formanaging a database (DB) to search for a data file includes: a database(DB) for storing a rack of virtual RAM (RVR) file created bydiscriminating a single input irregular data file using a predetermineddivisional unit, and a record allocation table (RAT) file created bydetecting a record position for each divisional unit of the RVR file; arack of virtual RAM (RVR) controller for detecting a record position ofsearched information from the RAT file in association with inputsearched information, detecting a physical storage position contained ina storage medium of data corresponding to the searched information fromthe record position, searching for data of the physical position, andreading the searched data; and an analysis module for analyzing a resultread by the RVR controller.

The divisional unit may be any one of [pagen], [page], [fastan],[fasta], [line], [image], [audio] and [video].

The searched information may be an order of each divisional unit.

The record position may be the size of data accumulated in the singledata extended up to a specific position where corresponding data isrecorded.

The record position may be a number of a hard disk cluster in which dataof a corresponding part is recorded.

The storage medium may be a semiconductor storage medium.

In accordance with another aspect of the present invention, a system formanaging a database (DB) to search for a data file includes: a database(DB) for storing a rack of virtual RAM (RVR) file created bydiscriminating a single input data file using a regular divisional unitbased on a row and column, and a record allocation table (RAT) filecreated by detecting a record position for each divisional unit of theRVR file; a rack of virtual RAM (RVR) controller for detecting a recordposition of searched information from the RAT file in association withinput searched information, detecting a physical storage positioncontained in a storage medium of data corresponding to the searchedinformation from the record position, searching for data of the physicalposition, and reading the searched data; and an analysis module foranalyzing a result read by the RVR controller.

The divisional unit of the regular data may be any one of [seq], [int],[float], [string], [csv], [r], [xml] and [smtx].

The searched information may be a row number or a column number ofcorresponding data from among the regular data.

The record position may be the size of data accumulated in the singledata extended up to a specific position where corresponding data isrecorded.

The detecting of the storage position may include calculating a clusterposition from the record position using a size of data of eachdivisional unit, reading a physical storage position of the clusterposition from a file allocation table (FAT), and detecting the readphysical storage position of the cluster position.

The record position may be a number of a hard disk cluster in which dataof a corresponding part is recorded.

In accordance with another aspect of the present invention, a method forsearching for a data file includes: (D1) fragmenting a genome basesequence into predetermined-sized base units; (D2) allocating a uniquenumber to each fragmented base unit; (D3) storing a storage position ofeach base unit; and (D4) creating a record allocation table (RAT) file.

The unique number in the step (D2) may be a tetramal number classifiedaccording to bases constructing the base unit.

In accordance with another aspect of the present invention, a method forsearching for a data file, the method includes: (E1) assigning a serialnumber to input data according to a divisional unit; (E2) calculating aserial number of data that includes a word contained in the input data;(E3) creating a hash table including the word and the serial number; and(E4) creating a record allocation table (RAT) using the hash table.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other objects, features and other advantages of thepresent invention will be more clearly understood from the followingdetailed description taken in conjunction with the accompanyingdrawings, in which:

FIG. 1 is a conceptual diagram illustrating data stored in a generalhard disk.

FIG. 2 exemplarily shows the relationship among a data file, a RAT file,and an RVR file according to the embodiments of the present invention.

FIG. 3 exemplarily shows the relationship among an RVR file, a RAT file,and data stored in a disc according to the embodiments of the presentinvention.

FIG. 4 exemplarily shows the relationship between a data file and a RATfile according to the embodiments of the present invention.

FIGS. 5 and 6 show examples for creating a RAT file and an RVR file froma data file when stored data of the present invention is a generaldocument.

FIGS. 7 and 8 show examples for creating a RAT file and an RVR file froma data file when stored data of the present invention is a matrix.

FIG. 9 exemplarily shows a program and source code for performing recordand access functions of RVR and RAT files according to the embodimentsof the present invention.

FIG. 10 is a flowchart illustrating a method for creating an RVR fileand a RAT file according to the embodiments of the present invention.

FIG. 11 is a flowchart illustrating a method for searching for dataaccording to the embodiments of the present invention.

FIGS. 12 and 13 show the result of comparison between a data accessspeed of the present invention and a sequential data access speed of ageneral hard disk.

FIG. 14 is a block diagram illustrating an RVR DBMS according to thepresent invention.

FIG. 15 exemplarily shows divisional units for classifying irregulardata by an RVR DBMS according to the present invention.

FIG. 16 exemplarily shows divisional units for classifying regular databy an RVR DBMS according to the present invention.

FIG. 17 exemplarily shows a method for adding, deleting, updating, andsearching for data by an RVR DBMS according to the present invention.

FIG. 18 is a flowchart illustrating a method for creating an RVR fileand a RAT file of base sequence data by an RVR DBMS according to thepresent invention.

FIG. 19 exemplarily shows a method for creating a RAT file and a RATfile of base sequence data by an RVR DBMS according to the presentinvention.

FIG. 20 is a flowchart illustrating a method for creating an RVR fileand a RAT file of large-scale abstract data by an RVR DBMS according tothe present invention.

FIG. 21 is a conceptual diagram illustrating a method for creating anRVR file and a RAT file of large-scale abstract data by an RVR DBMSaccording to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Reference will now be made in detail to the preferred embodiments of thepresent invention, examples of which are illustrated in the accompanyingdrawings. In the drawings, the same or similar elements are denoted bythe same reference numerals even though they are depicted in differentdrawings. In the following description, a detailed description of knownfunctions and configurations incorporated herein will be omitted when itmay make the subject matter of the present invention rather unclear.Exemplary embodiments of the present invention provide a method forrecording data in a disc and a method for searching for data in a disc.

FIG. 2 exemplarily shows the relationship among a data file, a RecordAllocation Table (RAT) file, and a Rack of Virtual RAM (RVR) fileaccording to the embodiments of the present invention. FIG. 3exemplarily shows the relationship among an RVR file, a RAT file, anddata stored in a disc according to the embodiments of the presentinvention. FIG. 4 exemplarily shows the relationship between a data fileand a RAT file according to the embodiments of the present invention.

Referring to FIGS. 2 to 4, data according to the present invention isstored as an RVR file format, and a RAT file acting as a dynamic tableof the RVR record is generated and stored.

In other words, if a user attempts to store an arbitrary data file, adata file is converted into an RVR file, and a RAT file is generated,such that the RVR file and the RAT file are stored in a hard disk.

In this case, the RVR file is generated by including a divisional factorin a data file. In addition, the divisional factor is adapted todiscriminate data for each divisional unit serving as a recording unitof data. The divisional unit may be established in various ways, forexample, [paragraph], [line], [word], [string], [integer], or [float],etc.

Type and function of the divisional factor (divisional unit) willhereinafter be described with reference to a method for creating the RVRfile and the RAT file.

The RAT file stores a dynamic table that indicates the position of eachrecording unit, and indicates the position of specific data in the RVRfile during the data searching operation.

Referring to FIG. 3, it is assumed that the smallest data unit of a harddisk serving as a data storage unit is stored in each of first, second,third, fourth, . . . i−1^(th), i^(th), and N^(th) sectors. A cluster isa set of sectors, and is used as a record unit of data.

A file manager serving as a file management program arranges a clusterand a physical position according to a File Allocation Table (FAT), suchthat it can store a file.

However, a plurality of clusters may be required to store one file, andthe clusters are not allocated in regular order. In other words, thefile manager searches for a recordable cluster and stores a filecorresponding to the searched cluster. The order of clusters used forrecording the file is recorded in the FAT. While the file is reproduced(searched), the order of clusters is read and data can be read, suchthat the file can be reproduced or searched.

That is, as shown in FIGS. 2 to 4, individual physical cluster positionsare stored according to a series of cluster numbers.

Meanwhile, the RAT file for the above data is a data file distinguishedby a divisional factor for each divisional unit. FIG. 3 shows anexemplary text divided into line units. In accordance with theembodiment of the present invention, stored data of the presentinvention is a general document.

The created RAT file stores a serial number (line number in FIG. 3) forindicating the order of divisional factors and a record position (i.e.,address) where data corresponding to the serial number is recorded.

In this case, the address serving as the record position may berepresented by the size of accumulated data.

That is, the address serving as the record position can be representedby the following equation 1.

address[k]=(i−1)*bytes_of_record  [Equation 1]

In this case, assuming that all records use the same number of bytes,‘bytes_of_record’ is a constant decided according to hard diskcharacteristics.

Therefore, provided that the record position is divided by the constant(bytes_of_record), the cluster number (i′) can be recognized, such thatthe record position of physical data can be recognized through FAT.

Meanwhile, provided that stored data of the present invention isconfigured in the form of a matrix (table) and all records (recordunits) of the matrix use the same number of bytes, a record address of aspecific position (k) on the matrix is obtained by the partitioning ofEquation 1.

The partitioning result of Equation 1 can be represented by thefollowing equation 2.

address[k]=[x−1]*bytes_of_record+[y−1]*bytes_of_record*N

In Equation 2, k is a serial number of a matrix record unit, and N isthe number of divisional factors on an X axis.

Provided that bytes of respective records are different from oneanother, the following equation 3 can be obtained.

$\begin{matrix}{{{address}\lbrack k\rbrack} = {\sum\limits_{i = 2}^{k}\; {{bytes}\mspace{14mu} {of}\mspace{14mu} {{record}\left\lbrack {i - 1} \right\rbrack}}}} & \left\lbrack {{Equation}\mspace{14mu} 3} \right\rbrack\end{matrix}$

In Equation 3, k is a serial number of a matrix record unit,bytes.of.record[i] indicates bytes of paragraph records, i is a serialnumber of a specific line or a paragraph record, and address[k=1] isinitialized to zero ‘0’.

FIG. 5 shows an example for creating an RVR file from a data file whenstored data of the present invention is a general document. FIG. 6 showsan example for creating a RAT file from a data file when stored data ofthe present invention is a matrix-type regular document.

Referring to FIG. 5, if stored data is a general document (data anddocument have the same meaning), there are a variety of divisionalunits, for example, [paragraph], [line], [word], etc. As can be seenfrom FIG. 5, divisional units (i.e., [paragraph], [line], and [word])are applied to the same document such that the RVR file is created.

In this case, a general document is an irregular document in which thedocument format is irregular. The irregular document may indicate mostdocuments not written in a regular format instead of a matrix format(e.g., a table). The term ‘general document’ has the same meaning asthat of the term ‘irregular document’.

In other words, a first paragraph creates an RVR file using‘[paragraph]’ as a divisional unit. As shown in the drawing, eachparagraph is denoted by a divisional factor ‘>’.

A second paragraph creates an RVR file using ‘[line]’ as a divisionalunit. As shown in the drawing, each line is denoted by a divisionalfactor ‘\n’.

A third paragraph creates an RVR file using ‘[word]’ as a divisionalunit. As shown in the drawing, each paragraph is denoted by a divisionalfactor ‘ ’.

The divisional unit of data is decided according to user selection, andmay be replaced with an arbitrary symbol.

Meanwhile, as shown in FIG. 6, the RAT file is created from the createdRVR file. As described above, the RAT file includes not only a serialnumber for indicating the order of sequential divisional units of theRVR file but also a record position (address) where the correspondingdata is stored.

That is, as shown in FIG. 6, the amount of accumulated data isindicative of a record position.

FIG. 7 shows an example for creating a RAT file and an RVR file from adata file when data of the present invention is stored in matrix format.FIG. 6 shows an example for creating a RAT file and an RVR file from adata file when data of the present invention is stored in matrix format.

Referring to FIG. 7, if data is stored in matrix format, an additionaldivisional unit and an additional divisional factor are not present.That is, a row and a column of the matrix are a divisional unit and adivisional factor, respectively.

In this case, the matrix-format document is indicative of a regulardocument. The matrix-format document has the same meaning as that of theregular document.

In this case, a storage format is classified into [string], [integer]and [float] according to data formats stored in each matrix.

FIG. 7 shows an example of an arbitrary RVR file having a storage formatsuch as [string], [integer] or [float].

In this case, ‘string’ may indicate a storage format in which all kindsof data including character data and numeric data (including a decimalpoint) can be freely stored in a cell of the matrix.

In addition, ‘integer’ may indicate a storage format in which datastored in a cell of the matrix is an integer variable.

Also, ‘float’ may indicate a storage format in which data stored in acell of the matrix includes a decimal point.

Meanwhile, as shown in FIG. 8, the RAT file is created from the createdRVR file. As described above, the RAT file includes not only a serialnumber indicating a row number of the matrix-type RVR file but also arecord position where the corresponding data is stored.

FIG. 9 exemplarily shows a program and source code for performing recordand access functions of RVR and RAT files according to the embodimentsof the present invention.

In FIG. 9, a program for executing read/write (R/W) operations of theRVR-RAT is an Indexing RVR (IRVR). FIG. 9 shows actual exemplarysequences of a plurality of records of 6 different data (depending upona divisional unit or a data format) shown in FIGS. 5 and 7.

Values of respective bytes are calculated in different ways according torespective records and categories of computer operating systems (OSs).Specifically, a file can be converted into a binary file by the ‘fwrite()’ function of the C/C++ computer language program, such that the sizeof each record of individual input files returns to the unit of bytes.Therefore, while large data is converted into an RVR file, all datarecords are converted into values of bytes obtained through the ‘fwrite()’ function and all record addresses obtained by Equations 1, 2 and 3,and the converted result is output and stored as a RAT file.

General users other than experts handling high-level system programmingare unable to gain access to information about a FAT-Sector (See FIG.2). Therefore, a controller is used as an intermediate bridge betweenFAT-sectors. Likewise, general users who use most high-level computerlanguages (e.g., Perl, Python, Fortran, C/C++, JAVA, etc.) may use anRVR-RAT that includes a record and record address of a file stored in ahard disk in the same manner as in the FAT-Sector controller.

A method for recording data in a disc and a method for searching fordata according to the present invention will hereinafter be describedwith reference to the method for creating RVR/RAT files and the methodfor searching for data using the same.

FIG. 10 is a flowchart illustrating a method for creating an RVR fileand a RAT file according to the embodiments of the present invention.FIG. 11 is a flowchart illustrating a method for searching for dataaccording to the embodiments of the present invention.

An exemplary case in which stored data is general document data willhereinafter be described with reference to the annexed drawings.

Referring to FIG. 10, in accordance with a method for creating the RATfile and the RVR file, if a user attempts to store data, the system ofthe present invention receives information of a divisional unit from theuser at step S110.

Thereafter, upon receiving a data file, the system includes a divisionalfactor corresponding to the divisional unit in the divisional unitinformation so as to create an RVR file at step S120. Needless to say,the divisional factor does not include a certain function andfacilitates creation of the RAT file during a substantial searchoperation, such that it need not be contained in the RVR file.

In addition, the RAT file is created from the RVR file at step S130. TheRAT file discriminates the RVR file using the divisional factor, numberseach serial number, and records a record position of data correspondingto each serial number, such that the RAT file can be created.

Needles to say, the divisional factor may not include a certain functionduring the substantial search process, and facilitates creation of theRAT file, such that it may not be contained in the RVR file. In thiscase, the above data is divided by the divisional unit, and at the sametime the numbering of the serial number is performed. In addition, therecord position of the corresponding data is stored such that the RATFile is created.

In addition, the created RVR and RAT files are stored at step S140.

A method for searching for data using the RAT file according to thepresent invention will hereinafter be described with reference to theannexed drawings.

Referring to FIG. 11, in order to search for data using the RAT file ofthe present invention, the system for use in the present inventionreceives search information form a user at step S210.

The search information may indicate the order of each divisional unitwhen data is general data. If data is matrix-type data, the searchinformation may indicate a row number of the matrix.

That is, provided that the divisional unit is [paragraph] and the userattempts to search for the N-th paragraph, the search information isdenoted by N. Provided that the divisional unit is [line] and the userattempts to search for the N′-th line, the search information is denotedby N′. Provided that the divisional unit is [word] and the user attemptsto search for the N″-th word, the search information is denoted by N″.

Thereafter, the system of the present invention searches for the storedRAT file and reads a record position corresponding to the searchinformation at step S220.

Next, the system calculates a cluster number from the record position(address), and thus calculates a physical cluster position of data fromthe FAT at step S230.

In order to calculate the cluster number, Equations 1 to 3 can beutilized as previously stated above.

Thereafter, the system reads the physical data storage position of thehard disk and outputs the read result at step S250.

Next, the sequential data access speed of a general hard disk iscompared with the data access speed of the present invention.

FIGS. 12 and 13 show the result of comparison between a data accessspeed of the present invention and a sequential data access speed of ageneral hard disk.

In this case, the search data is one large-scale data of 192 gigabyteshaving a dimension denoted by ‘[X:20,000]*[Y:1,000,000]’, where Xindicates the presence of 20000 variables, each of which includes adecimal point and Y indicates the presence of one million of [X:20000].In this data, the sequential access time of each 10^(th), 100^(th),1000^(th), 10000^(th), 100000^(th), or 1000000^(th) record value of theY value is compared with a random access time using the RVR-RAT.

Under fedora 8.0 Linux environment, the above-mentioned test is carriedout by a 64-bit Quadra Core Zeon CPU, and this test is calculated by[IRVR] shown in FIG. 9.

Data located in the frontmost record position has a relatively shortaccess time, and the access speed of more than 1000000^(th) datageometrically increases (See FIG. 12).

In contrast, the data access speed of the present invention ismaintained at the almost constant time irrespective of the recordposition, and it can be recognized that the constant time is about 0.1sec considered to be superior.

Although the method of the present invention requires a considerabletime to create the RVR file and the RAT file, the method can very easilysearch for data after the RVR file and the RAT file are created.

A management system (hereinafter referred to as RVR DBMS) for managing adatabase (DB) using the above-mentioned file search method willhereinafter be described with reference to the annexed drawings.

FIG. 14 is a block diagram illustrating an RVR DBMS according to thepresent invention. FIG. 15 exemplarily shows divisional units forclassifying irregular data by an RVR DBMS according to the presentinvention. FIG. 16 exemplarily shows divisional units for classifyingregular data by an RVR DBMS according to the present invention. FIG. 17exemplarily shows a method for adding, deleting, updating, and searchingfor data by an RVR DBMS according to the present invention.

The RVR DBMS according to the present invention constructs the data fileusing a set (i.e., RVR file) of data records and a set (i.e., RAT file)of hard disk highly integrated indexes of the data record set, such thatit performs the same data management and analysis operations as those ofthe standard DB management system using the RVR file and the RAT file.

For this operation, as shown in FIG. 14, the RVR DBMS according to thepresent invention includes at least one DB, an RVR controller, and ananalysis module for analyzing data using stored in the DB.

In this case, the DB stores a data file applied to the RVR DBMS, andalso stores the RVR file and the RAT file that are manufactured by theaforementioned processing of the data file. Operations of creating andstoring the RVR file and RAT file have already been described in theafore-described best mode for implementing the present invention.

Therefore, the RVR controller creates and stores the RVR file and theRAT file using the data file, and performs a desired analysis operationusing the stored RVR and RAT files.

Meanwhile, in order to perform the above-mentioned analysis operation,the analysis module includes a programming language library (PLL) and astatistics language library (SLL).

In this case, PLL is an analysis module composed of various computerprogramming languages (Java, Perl, Python, C/C++, etc.), and SLL is ananalysis module composed of various statistical languages (R, SAS, SPSS,etc.). PLL or SLL is used as an analysis module library that directlyreceives a pointer of the RVR DB through a pipe and performs analysisand calculation operations on the received pointer. Therefore, PLL orSLL is an analysis library capable of minimizing a time requisite forI/O operations of the analysis system.

Next, a method for recognizing and discriminating data by the RVR DBMSaccording to the present invention will hereinafter be described indetail. Each data record defined in the RVR DBMS is always matched withthe hard disk highly integrated index. Although the above-mentioned harddisk drive (HDD) address has the same physical address as that of apointer used in C/C++ languages, it has a different method forrecognizing/using the physical address. In other words, the pointer ofthe C/C++ languages indicates a relatively absolute address about agiven part irrespective of data records. The hard disk absolute address(i.e., pointer) for use in the RVR DBMS begins with a first data recordof a given file and is a relative address of the first data record, anda relative record number (RRN) is given to each of the start and endaddresses.

Therefore, the pointer for use in the RVR DBMS indicates only hard diskhighly integrated index addresses of individual data records in the setof defined data records.

In addition, the size of a data record may be extended from a singlecharacter (=1 byte) to the whole human genome (=3 GB bytes), or the datarecord may be configured to have various sizes of more than the wholehuman genome (=3 GB bytes).

Specifically, when using addresses of all individual bases of the genomesequence, a DNA fragment of a predetermined size (e.g., 12oligonucleotides) moves one base by one base such that fragments of thewhole chromosome are constructed. The constructed fragments areconsidered to be A, C, G, and T corresponding to four base sequences,and A, C, G, and T are assigned to the 0^(th) order, the 1^(st) order,the 2^(nd) order, and the 3^(rd) order, respectively, and each of the A,C, G and T values is converted into a tetramal number (i.e., istetramalized) so that a hash table is configured. In addition, if anumber is given, the given number may be applied to a specific functionthat can freely and quickly modified into 12 base sequences.

In the above-mentioned case, if it is assumed that all bases aremodified into predetermined-sized records according to theabove-mentioned scheme or the similar scheme, the RVR-RAT for recordseach having 12 base sequences is constructed. In addition, a query basesequence for the data searching can be read through random access in thewhole genome related to a plurality of records each including 12 bases.

The embodiment of the base sequences will hereinafter be described indetail.

In addition, when recognizing and representing the data record, tab(comma, white space, line breaker, symbol ‘>’, etc.) may be utilized asnecessary. That is, the tab may be defined in different ways accordingto the user's intention.

Various divisional units for use in the RVR DBMS will hereinafter bedescribed in detail.

The divisional unit may also be extended in various ways other than thebest mode of the present invention, and more extended divisional unitscan be established as follows.

In other words, referring to FIGS. 15 and 13, the RVR DBMS processorscan process irregular data and regular data using a variety ofdivisional units (for example, [1]pagen, [2]page, [3]fastan, [4]fasta,[5]line, [6]image, [7]audio, [8]video, [9]seq, [10]int, [11]float,[12]string, [13]csv, [14]r, [15]×ml, and [16]smtx).

In this case, an irregular-type divisional unit (or a non-table typedivisional unit) will hereinafter be described in detail.

-   -   pagen: Paragraph-type data such as an abstract is modified into        a data record, and a line break present in each line is        recognized.    -   page: Although paragraph-type data such as an abstract is        modified into a data record, a line break is given only to the        end of record.    -   fastan: Although ‘fastan’ is equal to ‘fasta’, ‘fastan’ is able        to recognize a line break in each data record.

fasta: ‘fasta’ indicates a processor for a specific format in which aline (that includes content such as an ID and description (orannotation) of each record in the same manner as in a page format)including a fasta format related to DNA/protein data begins with ‘>’,does not permit a white space in a base sequence and amino acidsequence, and assigns a line break only to the end of a record.

-   -   line: ‘line’ is a processor that uses a line breaker of each        line as a separator and uses data as a data record.    -   image: ‘image’ is needed for individual data records of        multimedia. ‘image’ is a processor that converts file formats        (gif, jpeg, bmp, pict, pcx, etc.) of various still images into        data records.    -   video: ‘video’ is needed for individual data records of        multimedia. ‘video’ is a processor that converts various moving        file formats (mpeg, avi, asf, rm, wmv, etc.) into data records.    -   audio: ‘audio’ is needed for individual data records of        multimedia. ‘audio’ is a processor that converts various file        formats (way, asf, mp3, ogg, etc.) into data records in the same        manner as in data capable of being recognized by person's        hearing sense.

Meanwhile, in the case of regular-type data or table-type data, a harddisk drive (HDD) address is stored in units of a table line. Therefore,the address of each data record is calculated by adding a columnposition value to an address of each line and then adding sizeinformation up to the position value to the added result at thecorresponding line.

The regular-type data has the following divisional units.

-   -   seq: ‘seq’ is a processor that uses each line of DNA/protein        multiple alignment as a data record.    -   int: ‘int’ is a regular-type table data format composed of        integers.    -   float: ‘float’ is a regular-type table data format composed of        double precision.    -   string: ‘string’ is a regular-type table data format composed of        the same-sized or different-sized words.    -   csv: ‘csv’ is a value separated by a comma of an Excel file, and        indicates a table data format.    -   r: ‘r’ is a format about a file, that includes a header and an        ID of each line simultaneously while being configured in a ‘csv’        format.

Specifically, in accordance with the RVR DBMS, such processors (string',‘csv’ and ‘r’) use the double indexing schema because data records havethe same or different sizes. That is, the RVR DBMS stores a map of eachline and a map of each record size. However, since ‘int’ and ‘float’have the same sizes in data records, random access of all records arepossible on a single map.

xml: ‘xml’ is used for a format of object-type structured data.

smtx: ‘smtx’ is a reduction type of a (N×N) matrix. For example, after apart (i.e., a null part), that does not include information of eachline, of the (N×N) matrix is completely removed, the number of datarecords is displayed, if as many records and sub-records as the numberof data records are arranged, the size of matrix can be minimized. Theprocessor is able to use such data in the same manner as in the (N×N)matrix.

Indexes of irregular data and regular data will hereinafter be describedin detail.

Each index (addressing) used in the RVR DBMS may use a dense index. Inother words, a data address is always present in one data record, and ahashing table for a key or an ID about a data record is also used.

In accordance with the RVR DBMS, a regular data record performs denseindexing up to all sub-data records. In relation to irregular data, thedense indexing method is applied to a data record related to irregulardata, and the sparse indexing method is applied to sub-records.

FIG. 17 exemplarily shows a method for adding, deleting, updating, andsearching for data using an RVR DBMS according to the present invention.

In the case of the RVR DBMS, since the size of data record is notpresent and only the address of the corresponding position is present,‘addition’ is always processed at the end of the entire data record, and‘deletion’ is marked in address information.

‘insertion’ has the same meaning as the addition because the accessspeed is invariable irrespective of the position of data record.

‘update’ is performed to delete a data record through deletion,addition, and order-readjustment (e.g., case in which a specific ordermust be maintained as in ‘insertion’), and is also performed forreadjustment of data record addresses of a disc.

Hereinafter, a method for generating an RVR file and a RAT file aboutspecific data (genome base sequence and large-scale abstract data) usingthe RVR DBMS, and searching/analyzing data according to the embodimentsof the present invention will hereinafter be described in detail.

A method for processing genome base sequence data according to theembodiments of the present invention will hereinafter be described withreference to FIGS. 18 and 19. A method for processing large-scaleabstract data according to the embodiments of the present invention willhereinafter be described with reference to FIGS. 20 and 21.

FIG. 18 is a flowchart illustrating a method for creating an RVR fileand a RAT file of base sequence data by an RVR DBMS according to thepresent invention. FIG. 19 exemplarily shows a method for creating a RATfile and a RAT file of base sequence data by an RVR DBMS according tothe present invention.

Referring to FIG. 18, if input data is genome base sequence data, thegenome base sequence is fragmented into a predetermined-sized base unitat step S310. In this case, although the predetermined size may beestablished in various ways, it is assumed that 12 bases are exemplarilyutilized in the embodiment of the present invention as shown in FIG. 19.

That is, as shown in FIG. 19, in association with the entire basesequence, 12 bases from the first base (i.e., an initial base) arediscriminated. Then, 12 bases from the second base are discriminated.Until bases from the (the number of all bases−12)^(th) base to the last12 bases are sequentially discriminated, the discrimination operation iscontinued.

Next, a unique number is assigned to 7-digit base sequences separatedfrom one another at step S320. In this case, each base of thediscriminated base sequence is A, G, T or C, such that a unique numbercan be effectively assigned using a tetramal number.

In addition, number information including each base sequence to whichthe unique number is assigned, and each number position is indexed asshown in FIG. 19 (Step S330).

Next, the above-mentioned ‘smtx’-type RVR and RAT files are created fromthe indexed data and the created RVR and RAT files are stored (StepS330).

In this case, the RVR file may correspond to the entire base sequencedata. The RAT file stores base data divided into 7 base units, serialnumbers added to the 7 base units, and storage position of each baseunit.

Meanwhile, if a user attempts to search for data using the RVR and RATfiles and inputs a desired base unit (divided into 12 base units) to aquery, the RVR DBMS according to the present invention searches for aninput base unit in the RAT file, searches for a data position includingthe base unit, and informs the user of data corresponding to thesearched data. Needless to say, if a different analysis command otherthan the search command is present, the analysis task is performed usingthe above-mentioned searched result, and the result is applied to theuser.

A method for mining data large abstract data using the RVR DBMSaccording to the embodiment of the present invention will hereinafter bedescribed.

FIG. 20 is a flowchart illustrating a method for creating an RVR fileand a RAT file of large abstract data by an RVR DBMS according to thepresent invention. FIG. 21 is a conceptual diagram illustrating a methodfor creating an RVR file and a RAT file of large-scale abstract data byan RVR DBMS according to the present invention.

In this case, as shown in FIG. 21, large-scale abstract data is composedof a plurality of abstract data such that it constructs large-scaledata.

If the input data is large-scale abstract data, a serial number isassigned to each abstract data at step S410.

In relation to words contained in the abstract data, a serial number(RRN) of data including the above word is calculated at step S420.

Thereafter, each word, a serial number assigned to each word, and thenumber of words are configured in a hash table at step S430.

In this case, although steps S410 to S430 shown in FIG. 20 are performedseparately from one another, steps S410 to S430 may be simultaneouslyperformed. The RVR DBMS searches the abstract data from the start partto the end part, calculates serial numbers (RRNs) of initial words, andstores the calculated RRNs. In relation to the overlapped words, numberinformation and a RRN are added to conventional data, such that theoverlapped words are created.

Thereafter, ‘smtx’-type RVR and RAT files are created from dataconfigured in a table format in step S430 (Step S440).

In this case, the RVR file may correspond to data in which inputlarge-scale abstract data is divided by a serial number (abstract 1,abstract 2, . . . in FIG. 21). The RAT file stores individual words,serial numbers (RRNs) of such words, and the number of stored words.

Meanwhile, the method for searching/analyzing data using the RVR fileand the RAT file according to the embodiment of the present invention isperformed in the same principle as in the aforementioned base sequencedata. However, the RVR DBMS receives each word as a query, and performsan operation corresponding to the received query.

The detailed description of the exemplary embodiments of the presentinvention has been given to enable those skilled in the art to implementand practice the invention. Although the invention has been describedwith reference to the exemplary embodiments, those skilled in the artwill appreciate that various modifications and variations can be made inthe present invention without departing from the spirit or scope of theinvention described in the appended claims. For example, those skilledin the art may use each construction described in the above embodimentsin combination with each other.

Although the above-mentioned embodiment has exemplarily disclosed thatthe RVR DBMS of the present invention is used in a HDD, the presentinvention can be applied to a variety of storage mediums used as asubstitute of a HDD. For example, the present invention can also beapplied to either a solid state drive (SSD) (solid state disk) that usesa flash memory as a substitute of a HDD, or a Dynamic Random AccessMemory (DRAM). In this case, the concept of the RVR file is identical tothat of the RAT file as described above, and SSD and DRAM must beinterpreted as a substitute of a HDD.

The present invention relates to a method for creating/storing a filethat facilitates a search operation, and a data search method using thesame.

In recent times, a task for decoding human genome sequences of 1000people is being conducted by United States NIH(http://www.1000genomes.org/). Only the amount of all data is about 3terabytes, and it is impossible for the standard DBMS to process data ofabout 3 terabytes.

In the Republic of Korea, through the Korean Association Resource(KARE)—I project of Korea Centers for Disease Control and Prevention(KCDC) in 2007, the size of single dielectric data is about 500Gigabytes. In KARE-II in 2008, similar data is further created by 2Terabytes. In addition, it is impossible for the standard DBMS to createa database (DB) related to clinical epidemiology function information.

Therefore, when the present invention is applied to the task forstoring/searching the latest data that is being developed to largecapacity data, the present invention has greater effects in economicefficiency and research execution speed.

For example, theoretically, a similar (or homologous) matrix of (100 Kbytes×100 K bytes) data records is created. In order to performexhaustive clustering of data using this matrix, the (100 K bytes×100 Kbytes) matrix must be normally loaded in a DRAM. In this case, if theC/C++ program uses precise integer variable (double), a DRAM of 8Gigabytes (TB) is needed.

Therefore, the RVR-RAT scheme that uses a HDD is absolutely required toresearch such large-volume clustering.

Although the RVR DBMS according to the present invention can be used asa DBMS having various purposes, the current RVR DBMS version can be mostefficiently used as a method for analyzing/managing large-volume bulkdata for scientific technology. By means of some additional formats(minimum formatting task), the RVR DBMS can be directly connected to adata process and an analysis module of a DBMS. In addition, according tothe present invention, several files can be quickly DBMS-processed inthe same manner as in each user's Web 2.0 personal computer (PC) actingas a server. Cloud computing means a service that enables many users touse the analysis/calculation devices centralized in one place over theInternet. Such cloud computing is implemented by the present invention,such that a plurality of users can perform rapid calculation. The RVRDBMS can obtain the best application result from the cloud computing. Ifhighly integrated indexing of all data is performed, data can be quicklydistributed. Parallel distribution calculation of such distributed datacan be quickly processed using a large number of PC clusters indicatingthe best advantage of the cloud computing technology.

As apparent from the above description, the file creating method forsearching for single data and a method for searching for a single datafile according to the embodiments of the present invention have thefollowing effects.

In accordance with Rack of Virtual RAM (RVR) serving as a binary filefor use in the present invention, addresses of all data records on ahard disk are recorded in an RAT file. Therefore, a user randomlyaccesses RVR file record information using not only programminglanguages (Perl, Python, Fortran, C/C++, JAVA, etc.) but also addressinformation stored in the RAT file, formats the accessed resultantinformation using such programming languages, and outputs the formattedresult. Therefore, the embodiments of the present invention can create adatabase (DB) for large irregular data and can also analyze data.

In addition, the present invention can implement random access using arelatively cheap hard disk without the need for large amounts of DRAM,resulting in economic efficiency.

With the development biotechnology, more than 2000 whole genomesequences ranging from microorganisms to animals and plants have beendecoded and a single human genome consumes about 3 gigabytes.

In the meantime, an RVR database management system (DBMS) according tothe present invention performs DBMS using data records of data files andtheir addresses, whereas the conventional standard DBMS constructs aregular data table, inputs data to the table, and applies a DBMS to theinput table. In addition, the RVR DBMS according to the presentinvention constructs a plurality of tables in the same manner as in thestandard DBMS, and systemically applies a DBMS to the relationshipbetween inter- or intra-tables. Compared to the above-mentioned standardscheme, the RVR DBMS according to the present invention has advantagesin that it constructs RVR-RATs for data records having different formatsin different files and performs DBMS for the inter- or intra-files.

In the meantime, the RVR DBMS according to the present invention has thefollowing advantages as compared to the conventional standard DBMSs(e.g., Main Memory based DBMS (MMDBMS), Disk resident DBMS (DRDBMS), andHybrid DBMS (HDBMS)). There are differences among MMDBMS, DRDBMS andHDBMS. In more detail, MMDBMS enables a table present in a memory to befilled with data, DRDBMS enables a table present in a hard disk to befilled with data, and HDBMS stores data in a memory for rapidcalculation and stably stores data in a disc. DRDBMS is preferable whendealing with large amounts of data, although response time is slow.However, MMDBMS is better suited to small amounts of data. Theabove-mentioned two advantages of DRDBMS and MMDBMS are all present inHDBMS.

Compared to the standard DBMS, the RVR DBMS scheme has the followingcharacteristics (1), (2), (3), (4) and (5).

(1) The RVR DBMS scheme uses only hard disk highly integrated indexaddresses of a data file composed of a specific format stored in a harddisk, such that the RVR DBMS scheme is identical to the DRDBMS scheme.(2) The RVR DBMS scheme is similar to MMDBMS and has a rapid interactionspeed. (3) Specifically, the RVR DBMS scheme can be applied even to bulkdata for science and technologies. In addition, DBMS and analysisprocesses can also be easily applied to irregular data such as a largenumber of genome sequences incapable of being processed using thestandard DBMS. (4) The RVR DBMS scheme is used as DBMS for data files,such that it can perform statistics and analysis calculation of data.

Therefore, the RVR DBMS scheme can be utilized in a system for analyzinginteraction data for use in science and technology. (5) In addition, theRVR DBMS scheme performs highly integrated indexing of all filescontained in a hard disk and manages the highly integrated result, suchthat it can easily distribute data. Such capability for easier datadistribution of the RVR DBMS scheme can be more efficiently applied tocloud computing capable of easily performing distributed calculation.

It will be apparent to those skilled in the art that variousmodifications and variations can be made in the present inventionwithout departing from the spirit or scope of the inventions. Thus, itis intended that the present invention covers the modifications andvariations of this invention provided they come within the scope of theappended claims and their equivalents.

What is claimed is:
 1. A file creation method for searching for a singleirregular data file, the method comprising: (A1) receiving a divisionalunit of data as an input; (A2) discriminating the single irregular datafile using the received divisional unit, and creating a rack of virtualRAM (RVR) file; (A3) detecting a record position for each divisionalunit of the RVR file, and creating a record allocation table (RAT) file;and (A4) storing the RVR file and the RAT file.
 2. The method accordingto claim 1, wherein: the data is regular data, and the divisional unitis any one of [page], [paragraph], [line] and [word].
 3. The methodaccording to claim 1, wherein the record position is the size of dataaccumulated in the single data extended up to a specific position wherecorresponding data is recorded.
 4. The method according to claim 1,wherein the record position is a number of a hard disk cluster in whichdata of a corresponding part is recorded.
 5. A file creation method forsearching for a single regular data file, the method comprising: (B1)discriminating between a row and a column of one regular data, andcreating a rack of virtual RAM (RVR) file; (B2) detecting a recordposition for each row or column of the RVR file, and creating a recordallocation table (RAT) file; and (B3) storing the RVR file and the RATfile.
 6. The method according to claim 5, wherein the record position isthe size of data accumulated in the single data extended up to aspecific position where corresponding data is recorded.
 7. The methodaccording to claim 5, wherein the record position is a number of a harddisk cluster in which data of a corresponding part is recorded.
 8. Themethod according to claim 5, wherein the record position is the size ofdata accumulated in the single data extended up to a specific positionwhere corresponding data is recorded.
 9. The method according to claim5, wherein the record position is a number of a hard disk cluster inwhich data of a corresponding part is recorded.
 10. A method forsearching for a single data file, the method comprising: (C1) receivingsearch information; (C2) detecting a record position contained in singledata corresponding to searched information from a record allocationtable (RAT) file; (C3) detecting a physical storage position containedin a storage medium of data corresponding to the searched informationfrom the record position; and (C4) searching for data of a physicalposition of the data, and outputting the searched result.
 11. The methodaccording to claim 10, wherein, if the single data is irregular data,the searched information is an order of each divisional unit.
 12. Themethod according to claim 10, wherein, if the single data is regulardata, the searched information is a number of a row or column ofcorresponding data from among the regular data.
 13. The method accordingto any one of claim 10, wherein the record position is the size of dataaccumulated in the single data extended up to a specific position wherecorresponding data is recorded.
 14. The method according to claim 13,wherein the detecting step (C3) of the storage position includes:calculating a cluster position from the record position using a size ofdata of each divisional unit, reading a physical storage position of thecluster position from a file allocation table (FAT), and detecting theread physical storage position of the cluster position.
 15. The methodaccording to any one of claim 10, wherein the record position is anumber of a hard disk cluster in which data of a corresponding part isrecorded.
 16. A method for searching for a data file, the methodcomprising: (D1) fragmenting a genome base sequence intopredetermined-sized base units; (D2) allocating a unique number to eachfragmented base unit; (D3) storing a storage position of each base unit;and (D4) creating a record allocation table (RAT) file.
 17. The methodaccording to claim 16, wherein the unique number in the step (D2) is atetramal number classified according to bases constructing the baseunit.
 18. A method for searching for a data file, the method comprising:(E1) assigning a serial number to input data according to a divisionalunit; (E2) calculating a serial number of data that includes a wordcontained in the input data; (E3) creating a hash table including theword and the serial number; and (E4) creating a record allocation table(RAT) using the hash table.